AITopics | data-centric machine learning

Collaborating Authors

data-centric machine learning

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

605bbd006beee7e0589a51d6a50dcae1-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-15-2026, 02:37:26 GMT

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Europe > United Kingdom > Scotland > City of Glasgow > Glasgow (0.04)
(9 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (0.93)

Industry:

Law (0.67)
Energy (0.46)
Information Technology (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(4 more...)

Add feedback

The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track

Neural Information Processing SystemsOct-10-2025, 04:11:38 GMT

Data curation is a field with origins in librarianship and archives, whose scholarship and thinking on data issues go back centuries, if not millennia.

dataset, documentation, evaluation, (15 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Europe > United Kingdom > Scotland > City of Glasgow > Glasgow (0.04)
(9 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (0.93)

Industry:

Law (0.67)
Energy (0.46)
Information Technology (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(3 more...)

Add feedback

Towards Data-centric Machine Learning on Directed Graphs: a Survey

Sun, Henan, Li, Xunkai, Su, Daohan, Han, Junyi, Li, Rong-Hua, Wang, Guoren

arXiv.org Artificial IntelligenceDec-11-2024

In recent years, Graph Neural Networks (GNNs) have made significant advances in processing structured data. However, most of them primarily adopted a model-centric approach, which simplifies graphs by converting them into undirected formats and emphasizes model designs. This approach is inherently limited in real-world applications due to the unavoidable information loss in simple undirected graphs and the model optimization challenges that arise when exceeding the upper bounds of this sub-optimal data representational capacity. As a result, there has been a shift toward data-centric methods that prioritize improving graph quality and representation. Specifically, various types of graphs can be derived from naturally structured data, including heterogeneous graphs, hypergraphs, and directed graphs. Among these, directed graphs offer distinct advantages in topological systems by modeling causal relationships, and directed GNNs have been extensively studied in recent years. However, a comprehensive survey of this emerging topic is still lacking. Therefore, we aim to provide a comprehensive review of directed graph learning, with a particular focus on a data-centric perspective. Specifically, we first introduce a novel taxonomy for existing studies. Subsequently, we re-examine these methods from the data-centric perspective, with an emphasis on understanding and improving data representation. It demonstrates that a deep understanding of directed graphs and their quality plays a crucial role in model performance. Additionally, we explore the diverse applications of directed GNNs across 10+ domains, highlighting their broad applicability. Finally, we identify key opportunities and challenges within the field, offering insights that can guide future research and development in directed graph learning.

graph, graph neural network, representation, (14 more...)

arXiv.org Artificial Intelligence

2412.01849

Country:

Asia > China > Beijing > Beijing (0.04)
North America > United States (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(4 more...)

Genre: Overview (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Neurology (1.00)
Transportation (0.93)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

The Re-Label Method For Data-Centric Machine Learning

Guo, Tong

arXiv.org Artificial IntelligenceJan-14-2024

In industry deep learning application, our manually labeled data has a certain number of noisy data. To solve this problem and achieve more than 90 score in dev dataset, we present a simple method to find the noisy data and re-label the noisy data by human, given the model predictions as references in human labeling. In this paper, we illustrate our idea for a broad set of deep learning tasks, includes classification, sequence tagging, object detection, sequence generation, click-through rate prediction. The dev dataset evaluation results and human evaluation results verify our idea.

model prediction, noisy data, prediction, (11 more...)

arXiv.org Artificial Intelligence

2302.04391

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Detecting Manufacturing Defects in PCBs via Data-Centric Machine Learning on Solder Paste Inspection Features

Prasad-Rao, Jubilee, Heidary, Roohollah, Williams, Jesse

arXiv.org Artificial IntelligenceSep-6-2023

Automated detection of defects in Printed Circuit Board (PCB) manufacturing using Solder Paste Inspection (SPI) and Automated Optical Inspection (AOI) machines can help improve operational efficiency and significantly reduce the need for manual intervention. In this paper, using SPI-extracted features of 6 million pins, we demonstrate a data-centric approach to train Machine Learning (ML) models to detect PCB defects at three stages of PCB manufacturing. The 6 million PCB pins correspond to 2 million components that belong to 15,387 PCBs. Using a base extreme gradient boosting (XGBoost) ML model, we iterate on the data pre-processing step to improve detection performance. Combining pin-level SPI features using component and PCB IDs, we developed training instances also at the component and PCB level. This allows the ML model to capture any inter-pin, inter-component, or spatial effects that may not be apparent at the pin level. Models are trained at the pin, component, and PCB levels, and the detection results from the different models are combined to identify defective components.

data-centric machine learning, detecting manufacturing defect, solder paste inspection feature, (1 more...)

arXiv.org Artificial Intelligence

2309.03113

Genre: Research Report (0.40)

Industry: Law > Torts Law (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.53)

Add feedback

Data-Centric Machine Learning: Building Shopify Inbox's Message Classification Model

#artificialintelligenceJul-22-2022, 01:10:32 GMT

Shopify Inbox is a single business chat app that manages all Shopify merchants' customer communications in one place, and turns chats into conversions. As we were building the product it was essential for us to understand how our merchants' customers were using chat applications. Were they reaching out looking for product recommendations? Wondering if an item would ship to their destination? Or were they just saying hello? With this information we could help merchants prioritize responses that would convert into sales and guide our product team on what functionality to build next.

shopify inbox, taxonomy, training data, (11 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Data-Centric Machine Learning in the Legal Domain

Westermann, Hannes, Savelka, Jaromir, Walker, Vern R., Ashley, Kevin D., Benyekhlef, Karim

arXiv.org Artificial IntelligenceJan-17-2022

Machine learning research typically starts with a fixed data set created early in the process. The focus of the experiments is finding a model and training procedure that result in the best possible performance in terms of some selected evaluation metric. This paper explores how changes in a data set influence the measured performance of a model. Using three publicly available data sets from the legal domain, we investigate how changes to their size, the train/test splits, and the human labelling accuracy impact the performance of a trained deep learning classifier. We assess the overall performance (weighted average) as well as the per-class performance. The observed effects are surprisingly pronounced, especially when the per-class performance is considered. We investigate how "semantic homogeneity" of a class, i.e., the proximity of sentences in a semantic embedding space, influences the difficulty of its classification. The presented results have far reaching implications for efforts related to data collection and curation in the field of AI & Law. The results also indicate that enhancements to a data set could be considered, alongside the advancement of the ML models, as an additional path for increasing classification performance on various tasks in AI & Law. Finally, we discuss the need for an established methodology to assess the potential effects of data set properties.

classifier, experiment, f1-score, (14 more...)

arXiv.org Artificial Intelligence

2201.06653

Country:

North America > United States (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > India (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Law (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback